Introduction

According to the Center for Disease Control and Prevention (CDC), cardiovascular or heart diseases account for 1 in every 4 deaths in the United States annually. The estimated costs to the nation’s economy amount to about $363 billion annually, which include hospitalizations, after services health care needs as well as lost productivity due to death and disability (SS Virani, 2021). The CDC also states that given the lifestyle habits of an average American, about half of the population (47%) are reported to have “1 of the 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking”. However, the risk factors for developing heart diseases are vast and diverse and tend to vary between most racial and ethnic groups in the United States.

Therefore, there is an immediate need to detect the major risk factors early on, so that a wellness-oriented healthcare system can be instituted. Ideally, such a system would assess the intensity of the risk factors (compared to a baseline) and focus efforts on preventing the occurrence of developing heart diseases in the first place rather than minimizing second occurrences (Liau et al, 2010; Giampaoli et al, 2005 ).

The beginnings of the current understanding of the interconnected role of risk factors was first discussed in the Framingham Heart Study. Started in 1948, the study’s original goal was “ to identify common factors or characteristics that contribute to cardiovascular disease” (Hong 2018). However, today the FHS has morphed into a multigenerational study gathering genetic information that help analyze family patterns of certain diseases, including cardiovascular ailments (National Heart, Lung and Blood Institute). Jumping off from this, current research in the field has focused on using precision medicine algorithms along with computational development technology to assess a patient’s likelihood of developing heart disease given certain lifestyle factors. Most of the studies have focused largely on Caucasian individuals as a sample and indicating that there is a gap in the usefulness of current risk assessment tools working for individuals of different races.

In this paper, we aim to understand the different risk factors that affect the occurrence of heart diseases among the US population, so that we may be able to see some of the intermingling between different risk factors.

Description of Dataset

The original data set can be retrieved from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) annual health survey. The survey interviews over four hundred thousand individuals in the USA asking questions regarding the risk factors of heart disease. Conducted in 2020 (published on February 15, 2021), the original data set has been cleaned from its original 300 variables to 18 variables by Kamil Pytlak and can be accessed on Kaggle .

# Lets load the appropriate libraries
library(tidyverse)
library(dplyr)
library(funModeling)
library(ezids)
library(ggplot2)
library(ggcorrplot)
library(corrr)
library(corrplot)
library(ROCR)
library(grid)
library(broom)
library(caret)
library(tidyr)
library(scales)
# ggthemr is not on CRAN. Need to install from console using 
#devtools::install_github('cttobin/ggthemr')
#library(ggthemr)
#library(ggthemes)
library(gridExtra)
library(data.table)
library(regclass)

# Load the Data 
heart <-read.csv("heart_2020_cleaned.csv")

# Descriptive Statistics  
status (heart)

From the table we can see that physical and mental health have a higher percentage of zeros. This means that just over 70% of the people in our data set report being physically sick in a 30 day period while about 60% of the respondents report feeling mentally unwell in that same time period.

We can also see that there are no N/A values in our data set. This is not surprising given that our data was cleaned before being retrieved.

Below are the plots showing the means of our numerical variables. We can see that, on average, people with heart disease have a slightly higher BMI than those who don’t have heart disease. However, the biggest difference is the number of days that a respondent feels physically unwell. Respondents who have heart disease reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. Poor mental health in a 30 day period is slightly higher for people with heart disease, than for people without.

# Finding the mean of numerical variables grouped by heart disease
plot_num(heart)

heart %>%
  group_by(HeartDisease) %>%
  summarise_at(vars("BMI", 
                    "PhysicalHealth", 
                    "MentalHealth",
                    "SleepTime"), mean)
## # A tibble: 2 × 5
##   HeartDisease   BMI PhysicalHealth MentalHealth SleepTime
##   <chr>        <dbl>          <dbl>        <dbl>     <dbl>
## 1 No            28.2           2.96         3.83      7.09
## 2 Yes           29.4           7.81         4.64      7.14
# Changing Diabetes bordeline and pregnancy categories to Yes/No

heart$Diabetic<- replace (heart$Diabetic, heart$Diabetic== "No, borderline diabetes" , "Yes")
heart$Diabetic[heart$Diabetic== "Yes (during pregnancy)"] <- "No"  

# Changing all the categorical variables to numerical dummies 

heart$HeartDisease<-ifelse(heart$HeartDisease=="Yes",1,0)
heart$Smoking<-ifelse(heart$Smoking=="Yes",1,0)
heart$AlcoholDrinking<-ifelse(heart$AlcoholDrinking=="Yes",1,0)
heart$Stroke<-ifelse(heart$Stroke=="Yes",1,0)
heart$DiffWalking<-ifelse(heart$DiffWalking=="Yes",1,0)
heart$PhysicalActivity<-ifelse(heart$PhysicalActivity=="Yes",1,0)
heart$KidneyDisease<-ifelse(heart$KidneyDisease=="Yes",1,0)
heart$SkinCancer<-ifelse(heart$SkinCancer=="Yes",1,0)
heart$Diabetic<-ifelse(heart$Diabetic=="Yes",1,0)
heart$Asthma<-ifelse(heart$Asthma=="Yes",1,0)

Gender and Lifestyle Factors

According to Noel Bairey Merz, MD, director of the Barbra Streisand Women’s Heart Center in the Smidt Heart Institute, women experience heart disease more severely than men. Meaning that even though men are more likely to suffer heart attacks, women are more likely to die from a heart attack than a man.

We started off by trying to see which gender tends to have more heart disease based on symptoms reported just before diagnosis. From the table below, we can see that we have a lot more data for people without heart disease than with. However, even in the percentage table for those who do have heart disease, we see that there are more men with heart disease (about 5%) compared to women (3%). This is in line with our basic theory that men have more heart disease. However, we are then forced to ask why do men have more heart disease in our sample? Are they doing more or less of a certain activity than their female counterparts? Discussions of the relevant variables are provided below.

Proportion of Heart Disease Among the Genders

contable = table(heart$HeartDisease, heart$Sex)
xkabledply(contable, title="Contingency Table for heart Disease and Sex")
Contingency Table for heart Disease and Sex
Female Male
0 156571 135851
1 11234 16139
addmargins(contable)
##      
##       Female   Male    Sum
##   0   156571 135851 292422
##   1    11234  16139  27373
##   Sum 167805 151990 319795
prop <-prop.table(contable)
percent <- round(prop*100, 2)

print ("Contingency Table for Heart Disease and Sex in Percentage")
## [1] "Contingency Table for Heart Disease and Sex in Percentage"
print (percent)
##    
##     Female  Male
##   0  48.96 42.48
##   1   3.51  5.05
barplot(with(heart, table(HeartDisease, Sex)), main="Heart Disease Among Male & Female", beside=T, col=3:2)

In our exploratory analysis, we compared the different lifestyle variables between men and women and then further compared those same variables between those with and without heart disease.

BMI: Is there a difference between the mean BMI of men and woman in the overall population?

hist(heart$BMI)

In the histogram, we can see that our data is approximately normal and given the Central Limit Theorem, if n>30, then we can assume normality for the sake of statistical tests.

After sub-setting the data between male and female, we conducted a Welch’s t-test on the BMIs of both population subsets.

# Subset Male & Female with and without Heart Disease
female <- subset(heart, Sex== "Female")
male  <- subset (heart, Sex == "Male")

# T-test of BMI between Genders 
t.test(female$BMI, male$BMI)
## 
##  Welch Two Sample t-test
## 
## data:  female$BMI and male$BMI
## t = -15, df = 3e+05, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.387 -0.299
## sample estimates:
## mean of x mean of y 
##      28.2      28.5

The p-value is less than the 5% significance level, allowing us to reject the null hypothesis and conclude that the overall average BMI is different between the male and female populations. From the mean values, we see that men have a mean BMI of 28.50 while women have a BMI of 28.16. A difference of 0.34.

Next, we divided our data frame between men and women who do have heart disease. Once again assuming normality due to the large sample size, we conducted another Welch’s T-test to look at the mean BMI difference.

BMI: Is there a difference in average BMI between the sexes with heart disease?

# Subsetting genders based on respondents that have heart disease 
female_HD<- subset(female, HeartDisease==1)
male_HD<- subset(male, HeartDisease== 1)

#T-test
t.test(female_HD$BMI, male_HD$BMI)
## 
##  Welch Two Sample t-test
## 
## data:  female_HD$BMI and male_HD$BMI
## t = -0.5, df = 20988, p-value = 0.6
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.208  0.121
## sample estimates:
## mean of x mean of y 
##      29.4      29.4

From the T-test output above, we see that the p-value is 0.60, which is greater than 0.05. This means that we fail to reject the null and conclude that there is no significant difference in the average BMI between men and women who do have heart disease.

Physical Health: How many times in the last 30 days have you felt physically ill?

Earlier, we saw that respondents, who have heart disease, reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. The box plot below shows us that women with heart disease generally report felling unwell more often than men.

ggplot(heart, aes(x = factor(HeartDisease), y = PhysicalHealth))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
    ggtitle('How many times in the last 30 days have you felt physically ill?') +
    xlab('Heart Disease') + 
    ylab('Physical Health')+
    scale_fill_brewer(palette = 'Set2', name = 'Gender')

t.test(male_HD$PhysicalHealth, female_HD$PhysicalHealth)
## 
##  Welch Two Sample t-test
## 
## data:  male_HD$PhysicalHealth and female_HD$PhysicalHealth
## t = -16, df = 23059, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.53 -1.97
## sample estimates:
## mean of x mean of y 
##      6.88      9.14

The t-test also confirms what the box plot is showing. Since the p-value is less than 5%, that there is a difference in the number of days that men and women feel unwell. On average, women reported feeling sick about 9 days while men reported just over 6 days.

Mental Health: How many times in the last 30 days have you felt mentally ill?

ggplot(heart, aes(x = factor(HeartDisease), y = MentalHealth))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
     ggtitle('How many times in the last 30 days have you felt mentally ill?') +
    xlab('Heart Disease') + 
    ylab('Mental Health')+
    scale_fill_brewer(palette = 'Set3', name = 'Sex')

t.test  (male_HD$MentalHealth,
female_HD$MentalHealth)
## 
##  Welch Two Sample t-test
## 
## data:  male_HD$MentalHealth and female_HD$MentalHealth
## t = -22, df = 21050, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.79 -2.34
## sample estimates:
## mean of x mean of y 
##      3.59      6.16

The boxplot shows that women tend to report feeling mentally unwell more often than men.Since the p-value is less than 5%, indicating that there is a difference in the number of days that men and women feel mentally unwell. On average, women reported feeling mentally unwell about twice as many days as men. The cyclical nature of mental health and exercise has been covered extensively (Mikkelsen 2017; Raglin 1990); which assert that individuals struggling with mental health issues often are unable to complete physically demanding tasks, further making matters worse. This is because exercise has been shown to produce beneficial hormones that are associated with improvements in mental health.

Sleep Time: On average,how many hours of sleep do you get in a 24-hour period?

ggplot(heart, aes(x = factor(HeartDisease), y = SleepTime))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
    ggtitle(' On average,how many hours of sleep do you get in a 24-hour period?') +
    xlab('Heart Disease') + 
    ylab('Sleep Time')+ 
    scale_fill_brewer(palette = 'Set', name = 'Gender')

t.test  (male_HD$SleepTime,
          female_HD$SleepTime)
## 
##  Welch Two Sample t-test
## 
## data:  male_HD$SleepTime and female_HD$SleepTime
## t = 4, df = 22378, p-value = 2e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0515 0.1389
## sample estimates:
## mean of x mean of y 
##      7.18      7.08

Men and women have statistically different sleep times. Men on average get about 35 more minutes of sleep a week than women. This may not sound like much but sleep effects are known to be cumulative. A study conducted by Robbins et al. in 2021 “examined a group of people who got the recommended seven to eight hours of sleep a night to those who got less (even just 30 minutes less), and exposed them to a norovirus”. The study found that curtailing sleep below the recommended seven hours increases the risk for viral infection “nearly four times higher among those who are sleeping poorly or getting insufficient sleep, compared to those who are getting good rest” (Robbins et al, 2021).

This lack of sleep could explain why we see that women in our data on average report feeling sick more often and could also be a contributing factors is why women are more likely to die from heart attacks than men (Noel Bairey Merz).

#Smoking:Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]

#Smoking 
ggplot(heart, aes(x = factor(HeartDisease), y = Smoking))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
    ggtitle(' Have you smoked at least 100 cigarettes in your entire life?') +
    xlab('Heart Disease') + 
    ylab('Smoking')+ 
    scale_fill_brewer(palette = 'Set2', name = 'Gender')

t.test  ( male_HD$Smoking,
        female_HD$Smoking)

#Drinking: How many drinks have you had this week?

ggplot(heart, aes(x = factor(HeartDisease), y = Smoking))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
    ggtitle('How many drinks have you had this week?') +
    xlab('Heart Disease') + 
    ylab('Alcohol Consumption')+ 
    scale_fill_brewer(palette = 'Set3', name = 'Gender')

t.test  ( male_HD$AlcoholDrinking,
        female_HD$AlcoholDrinking)

Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week

Age, Race, General Health

HD <- subset(heart, HeartDisease==1)

freq(HD$AgeCategory)

##            var frequency percentage cumulative_perc
## 1  80 or older      5449      19.91            19.9
## 2        70-74      4847      17.71            37.6
## 3        65-69      4101      14.98            52.6
## 4        75-79      4049      14.79            67.4
## 5        60-64      3327      12.15            79.5
## 6        55-59      2202       8.04            87.6
## 7        50-54      1383       5.05            92.6
## 8        45-49       744       2.72            95.3
## 9        40-44       486       1.78            97.1
## 10       35-39       296       1.08            98.2
## 11       30-34       226       0.83            99.0
## 12       25-29       133       0.49            99.5
## 13       18-24       130       0.47           100.0
freq(HD$Race)

##                              var frequency percentage cumulative_perc
## 1                          White     22507      82.22            82.2
## 2                          Black      1729       6.32            88.5
## 3                       Hispanic      1443       5.27            93.8
## 4                          Other       886       3.24            97.0
## 5 American Indian/Alaskan Native       542       1.98            99.0
## 6                          Asian       266       0.97           100.0
freq(HD$GenHealth)

##         var frequency percentage cumulative_perc
## 1      Good      9558      34.92            34.9
## 2      Fair      7084      25.88            60.8
## 3 Very good      5381      19.66            80.5
## 4      Poor      3850      14.06            94.5
## 5 Excellent      1500       5.48           100.0

When looking at the population make-up of respondents who do have heart disease, we see that a higher percentage of respondents are 60 and older, with 82% being of Caucasian descent. At the time of the interview they also said they felt their overall health is good. This may be a limiting factor for future research since, once again, the sample does not have many data points from non-Caucasian individuals. We also see that most of our respondents with heart disease are older above 60 years of age, this is in-line with theory.

Even though a logit regression seems more appropriate to estimate whether or not an individual gets heart disease, we can still use a linear regression in our exploratory stages to find the best fitting model.

Liner Regression on Sex

reg_1 <- lm(HeartDisease~Sex+BMI, data=heart)
BIC(reg_1)
## [1] 90503
summary (reg_1)
## 
## Call:
## lm(formula = HeartDisease ~ Sex + BMI, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2521 -0.1024 -0.0805 -0.0587  0.9677 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.03e-03   2.29e-03     2.2    0.028 *  
## SexMale     3.85e-02   9.87e-04    39.0   <2e-16 ***
## BMI         2.20e-03   7.76e-05    28.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.279 on 319792 degrees of freedom
## Multiple R-squared:  0.0074, Adjusted R-squared:  0.00739 
## F-statistic: 1.19e+03 on 2 and 319792 DF,  p-value: <2e-16
reg_2 <- lm(HeartDisease~Sex+BMI+ PhysicalHealth+MentalHealth + SleepTime, data=heart)
BIC(reg_2)
## [1] 80963
summary (reg_2)
## 
## Call:
## lm(formula = HeartDisease ~ Sex + BMI + PhysicalHealth + MentalHealth + 
##     SleepTime, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3585 -0.0918 -0.0726 -0.0425  1.0001 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.18e-02   3.41e-03   -6.40  1.5e-10 ***
## SexMale         4.22e-02   9.78e-04   43.10  < 2e-16 ***
## BMI             1.43e-03   7.70e-05   18.55  < 2e-16 ***
## PhysicalHealth  6.18e-03   6.41e-05   96.37  < 2e-16 ***
## MentalHealth   -4.95e-04   6.44e-05   -7.69  1.5e-14 ***
## SleepTime       3.95e-03   3.41e-04   11.58  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.275 on 319789 degrees of freedom
## Multiple R-squared:  0.0367, Adjusted R-squared:  0.0367 
## F-statistic: 2.44e+03 on 5 and 319789 DF,  p-value: <2e-16
reg_3 <- lm(HeartDisease~Sex+BMI+ PhysicalHealth+MentalHealth + Smoking + AlcoholDrinking+ SleepTime + AgeCategory + Race, data=heart)
BIC(reg_3)
## [1] 60289
summary (reg_3)
## 
## Call:
## lm(formula = HeartDisease ~ Sex + BMI + PhysicalHealth + MentalHealth + 
##     Smoking + AlcoholDrinking + SleepTime + AgeCategory + Race, 
##     data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4570 -0.1220 -0.0529 -0.0020  1.0648 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -6.16e-02   5.25e-03  -11.74  < 2e-16 ***
## SexMale                 4.88e-02   9.57e-04   50.95  < 2e-16 ***
## BMI                     2.05e-03   7.62e-05   26.92  < 2e-16 ***
## PhysicalHealth          4.61e-03   6.32e-05   72.95  < 2e-16 ***
## MentalHealth            9.31e-04   6.39e-05   14.56  < 2e-16 ***
## Smoking                 3.50e-02   9.92e-04   35.27  < 2e-16 ***
## AlcoholDrinking        -2.25e-02   1.89e-03  -11.93  < 2e-16 ***
## SleepTime              -1.39e-03   3.33e-04   -4.16  3.1e-05 ***
## AgeCategory25-29       -6.29e-03   2.75e-03   -2.29   0.0222 *  
## AgeCategory30-34       -6.45e-03   2.69e-03   -2.40   0.0166 *  
## AgeCategory35-39       -5.83e-03   2.64e-03   -2.21   0.0269 *  
## AgeCategory40-44        1.14e-03   2.63e-03    0.43   0.6643    
## AgeCategory45-49        1.10e-02   2.61e-03    4.20  2.6e-05 ***
## AgeCategory50-54        2.92e-02   2.52e-03   11.57  < 2e-16 ***
## AgeCategory55-59        4.63e-02   2.44e-03   18.96  < 2e-16 ***
## AgeCategory60-64        6.98e-02   2.39e-03   29.17  < 2e-16 ***
## AgeCategory65-69        9.48e-02   2.39e-03   39.60  < 2e-16 ***
## AgeCategory70-74        1.32e-01   2.44e-03   54.00  < 2e-16 ***
## AgeCategory75-79        1.65e-01   2.65e-03   62.18  < 2e-16 ***
## AgeCategory80 or older  2.08e-01   2.58e-03   80.59  < 2e-16 ***
## RaceAsian              -2.45e-02   4.75e-03   -5.15  2.5e-07 ***
## RaceBlack              -2.01e-02   4.09e-03   -4.91  9.1e-07 ***
## RaceHispanic           -1.75e-02   4.03e-03   -4.35  1.4e-05 ***
## RaceOther              -1.30e-02   4.48e-03   -2.90   0.0037 ** 
## RaceWhite              -2.04e-02   3.73e-03   -5.48  4.3e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.266 on 319770 degrees of freedom
## Multiple R-squared:  0.0977, Adjusted R-squared:  0.0976 
## F-statistic: 1.44e+03 on 24 and 319770 DF,  p-value: <2e-16

With each new addition of a variable, we see that the adjusted R-squared is increasing indicating that the model fit is improving. However, the fit statistic is still quite low. Note that the 40-44 years age category has insignificant p-values. Since model 3 has the highest adjusted r-squared and the lowest BIC values, it will be our preferred model.

VIF Table to select variables

ezids::xkablevif(reg_3, wide=FALSE)
VIFs of the model
V1
AgeCategory25-29 1.03
AgeCategory30-34 1.06
AgeCategory35-39 1.14
AgeCategory40-44 1.17
AgeCategory45-49 1.08
AgeCategory50-54 1.02
AgeCategory55-59 1.04
AgeCategory60-64 1.72
AgeCategory65-69 1.81
AgeCategory70-74 1.89
AgeCategory75-79 1.92
AgeCategory80 or older 1.95
AlcoholDrinking 2.10
BMI 2.28
MentalHealth 2.45
PhysicalHealth 2.47
RaceAsian 2.37
RaceBlack 1.99
RaceHispanic 2.10
RaceOther 2.51
RaceWhite 5.04
SexMale 5.77
SleepTime 3.00
Smoking 11.28

From, the VIF table, the VIF value for smoking is 11.28, this is undesirable and thus will be dropped. Therefore our preferred model is as follows:

HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory

We will train and test the model to see the cost analysis of our predictions.

Logistic Regression

In order to address the unbalanced classification, we subsetted our original data frame to create a new one (called total) that had equal proportions of the Heart Disease category. The new dataset, called total, has over 54000 observation that have been split into train and test sets.

Model Training

# Creating a new dataframe (called total) with equal proportions of Heart Disease values 
newdata1<- subset(heart, HeartDisease==1)
newdata1_1<-head(newdata1,27373)
nrow(newdata1_1)
newdata0<- subset(heart, HeartDisease==0)
newdata0_1<-head(newdata0,27373)
nrow(newdata0_1)
total <- rbind(newdata1_1, newdata0_1)
#nrow(total)
summary(total)
#freq(total$HeartDisease)


# Convert to factor variables 
total$HeartDisease<-factor(total$HeartDisease)
total$Sex<-factor(total$Sex)

# Split into 80-20 train-test sets

set.seed(1234) 

sample <- sample.int(n = nrow(total), size = floor(.80*nrow(total)), replace = F)
train <- total[sample, ]
test  <- total[-sample, ]

# Logit Model Training
model<- glm(HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory, data= train, family='binomial')
#summary(model)

library(lemon)
knit_print.data.frame <- lemon_print

#Chi-squared test for model fit 
anova(model, test = 'Chisq')

# Exponential Coefficients  & Confidence Interval 
exp_coeff<- exp(cbind(OR = coef(model), confint(model)))
knitr::kable(exp_coeff,caption='Odds Ratio & Confidence Interval of Model')

Looking at the anova chi-squared results, we can conclude that the logit model (HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory) used is a good fit model for predicting the probability of getting heart disease. All the coefficient p-values in the model are also significant, except for the 25-39 age range, which is significant at the 10% level.

According to the model, the odds for getting heart disease increases by 2.11 for a man compared to a woman. While an increase in BMI increases the odds for heart disease by 1.03. Mental health, drinking, sleep and Physical health also affect your chances of getting heart disease but are not the most salient risk factors; they could be used as early indicators. As expected, the odds of getting heart disease for age increases as an individual gets older.

Also note, the confidence intervals for all the variables are narrow, indicating that they are good predictors of heart disease. However, the confidence intervals for age gets wider as we increase an individuals age. Although the uncertainty is greater at higher age ranges, there still may be enough precision to inform a patient about preventative measures, when considering their lifestyle choices.

Predicting and Assessing the Model

library(ggthemes)

# Prediction
train$prediction <- predict( model, newdata = train, type = "response" )
test$prediction  <- predict( model, newdata = test , type = "response" )

# distribution of the prediction score grouped by known outcome
ggplot( train, aes( prediction, color = as.factor(HeartDisease) ) ) + 
geom_density( size = 1 ) +
ggtitle( "Training Set's Predicted Score" ) + 
scale_color_economist( name = "data", labels = c( "No Heart Disease", "Yes Heart Disease" ) ) + 
theme_economist()

# Positive instance = 1 = Yes Heart Disease 
# Negative instance = 0 = No Heart Disease

After obtaining the predicted values that an individual will have heart disease on both the training and testing sets, a double density plot was drawn to show those predictions. Since we split our data evenly, the negative instances are skewed to the left while the positive instances are skewed to the right.

ROC-AUC

# user-defined different cost for false negative and false positive

source('unbalanced_functions.R')

cost.fp = 100
cost.fn = 200

roc_info <- ROCInfo( data = test, 
                     predict = "prediction", 
                               actual = "HeartDisease", 
                               cost.fp = 100,
                               cost.fn = 200 )
grid.draw(roc_info$plot)

Cots for false negatives are higher than false positives.

Confusion Matrix

#loadPkg("regclass")
xkabledply( confusion_matrix(model), title = "Confusion Matrix from Logit Model" )
#unloadPkg("regclass")

# 
cm_info <- ConfusionMatrixInfo( data = test, 
                                predict = "prediction", 
                                actual  = 'HeartDisease', 
                                cutoff  = .30 )
#ggthemr("flat")
plot(cm_info$plot)

End of Section

Heart Disease & Secondary Illnesses

In this part we are Identifying whether Heart Disease and secondary illnesses have correlation or not(are they dependent or independent?).

#Exploratory data analysis

library(ggplot2)
ggplot(heart, aes(HeartDisease)) +
  geom_bar(color="black",fill="aquamarine4")

patient <- subset(heart,HeartDisease==1)
head(patient)
##    HeartDisease  BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
## 6             1 28.9       1               0      0              6            0
## 11            1 34.3       1               0      0             30            0
## 36            1 33.0       1               0      1             10            0
## 43            1 25.1       0               0      0              0            0
## 44            1 30.2       1               0      0              6            2
## 46            1 20.4       1               0      0              3            0
##    DiffWalking    Sex AgeCategory  Race Diabetic PhysicalActivity GenHealth
## 6            1 Female       75-79 Black        0                0      Fair
## 11           1   Male       60-64 White        1                0      Poor
## 36           1   Male       75-79 White        1                1      Poor
## 43           1 Female 80 or older White        1                0      Good
## 44           1 Female       75-79 White        1                1      Fair
## 46           0 Female       70-74 White        0                0      Poor
##    SleepTime Asthma KidneyDisease SkinCancer
## 6         12      0             0          0
## 11        15      1             0          0
## 36         4      0             0          1
## 43         7      0             0          1
## 44         8      0             1          0
## 46        10      0             0          0
na_patient <- subset(heart,HeartDisease==0)

#barplot of HeartDisease vs Diabetic

library(ggplot2)
ggplot(patient, aes(x=Diabetic)) + geom_bar(color="black",fill="darkblue")

dia_pa<-subset(patient,Diabetic==1)
nrow(dia_pa)
## [1] 9746
ggplot(na_patient, aes(x=Diabetic)) + geom_bar(color="black",fill="darkblue")+ scale_y_continuous(labels = scales::comma)

dia_pa1<-subset(na_patient,Diabetic==1)
nrow(dia_pa1)
## [1] 37837

#barplot of HeartDisease vs Asthma

ggplot(patient, aes(x=Asthma)) + geom_bar(color="black",fill="cadetblue1")

ast_pa<-subset(patient,Asthma==1)
nrow(ast_pa)
## [1] 4933
ggplot(na_patient, aes(x=Asthma)) + geom_bar(color="black",fill="cadetblue1" )+ scale_y_continuous(labels = scales::comma)

ast_pa1<-subset(na_patient,Asthma==1)
nrow(ast_pa1)
## [1] 37939

#barplot of HeartDisease vs KidneyDisease

ggplot(patient, aes(x=KidneyDisease)) + geom_bar(color="black",fill="darkcyan")

kid_pa<-subset(patient,KidneyDisease==1)
nrow(kid_pa)
## [1] 3455
ggplot(na_patient, aes(x=KidneyDisease)) + geom_bar(color="black",fill="darkcyan")+ scale_y_continuous(labels =scales::comma)

kid_pa1<-subset(na_patient,KidneyDisease==1)
nrow(kid_pa1)
## [1] 8324

#barplot of HeartDisease vs SkinCancer

ggplot(patient, aes(x=SkinCancer)) + geom_bar(color="black",fill="darkgray")

skin_pa<-subset(patient,SkinCancer==1)
nrow(skin_pa)
## [1] 4980
ggplot(na_patient, aes(x=SkinCancer)) + geom_bar(color="black",fill="darkgray")+ scale_y_continuous(labels = scales::comma)

skin_pa1<-subset(na_patient,SkinCancer==1)
nrow(skin_pa1)
## [1] 24839
# We are seperating our desired variables to test them in an easier way.
diseases <- data.frame (heart$HeartDisease,
                        heart$Stroke,
                        heart$Diabetic,
                        heart$Asthma,
                        heart$KidneyDisease,
                        heart$SkinCancer)

# Rename columns

names(diseases) <- c ('heart_disease',
                      'stroke',
                      'diabetic',
                      'asthma',
                      'kidney_disease',
                      'skin_cancer')

Corellation Matrix and Correlation Plot Between Diseases

cor_matrix<-cor(diseases)
cor_matrix
model.matrix(~0+., data=diseases) %>% 
     cor(use="pairwise.complete.obs") %>% 
    ggcorrplot(show.diag = F, type="upper", lab=TRUE, lab_size=2)

From the above correlation test we can see that overall Correlations are not particularly strong. Heartdisease has more correlation with stroke than any other secondary illnesses(Which is not a surprise generally). Diabetes is correlated with heart disease and Kidney disease, Less with Stroke. Skin cancer shows negligible correlation with all other diseases. overall, we can see that only stroke, diabetes and kidney disease have considerable correlation with heart disease. Not only from our correlation test, but as per center for disease control and prevention, Common heart disorders can increase your risk for stroke. For example, coronary artery disease increases your risk for stroke, because plaque builds up in the arteries and blocks the flow of oxygen-rich blood to the brain. Also diabetes increases your risk for stroke(Stroke vs Diabetes in our correlation plot ).Diabetes causes sugars to build up in the blood and prevent oxygen and nutrients from getting to the various parts of your body, including your brain. High blood pressure is also common in people with diabetes. High blood pressure is the leading cause of stroke and is the main cause for increased risk of stroke among people with diabetes (CDC Stroke Fact sheet). (https://www.cdc.gov/stroke/conditions.htm#:~:text=Common%20heart%20disorders%20can%20increase,rich%20blood%20to%20the%20brain.)

Determining Independence Between Diseases

After finding the correlation between variables we are now performing a chi-suare test of independence to find out whether the varibles are dependent or independent with each other. The Chi-Square test of independence is used to determine if there is a significant relationship between two nominal (categorical) variables(All the variables we have here are already categorical). In a more general sense, it tests to see whether distributions of categorical variables differ from each another.The frequency of each category for one nominal variable is compared across the categories of the second nominal variable. The data can be displayed in a contingency table where each row represents a category for one variable and each column represents a category for the other variable. The null hypothesis for this test is that there is no relationship between primary disease heartDisease and other secondary illnesses. The alternative hypothesis is that there is a relationship between heartDisease and secondary illnesses.

output: the title of the test, which variables have been used, the test statistic, the degrees of freedom and the p-value of the test.

hea_as=table(heart$HeartDisease,heart$Asthma)
chi_1=chisq.test(hea_as)
chi_1

hea_dia=table(heart$HeartDisease,heart$Diabetic)
chi_2=chisq.test(hea_dia)
chi_2

hea_kid=table(heart$HeartDisease,heart$KidneyDisease)
chi_3=chisq.test(hea_kid)
chi_3

hea_skin=table(heart$HeartDisease,heart$SkinCancer)
chi_4=chisq.test(hea_skin)
chi_4

hea_stroke=table(heart$HeartDisease,heart$Stroke)
chi_5=chisq.test(hea_stroke)
chi_5

From the output and from test$p.value we see that the p-value is less than the significance level of 5%. Like any other statistical test, if the p-value is less than the significance level, we can reject the null hypothesis.Every secondary illness (Diabetic,Asthma, Kidney disease and skin cancer) has a relationship with heart Disease.

Logistic Regression Model

library(tidyverse)
library(caret)
heartsub <- data.frame(heart)
heartsub <- heartsub[, c("HeartDisease","Stroke","Diabetic","Asthma","KidneyDisease","SkinCancer")]

newdata1<- subset(heartsub, HeartDisease==1)
newdata1_1<-head(newdata1,20000)

newdata0<- subset(heartsub, HeartDisease==0)
newdata0_1<-head(newdata0,20000)

total <- rbind(newdata1_1, newdata0_1)

set.seed(100)
heart_train_rows <- sample(1:nrow(total), 0.7*nrow(total))
heart_train <- total[heart_train_rows, ]
heart_test <- total[-heart_train_rows, ]
mod1 <- glm(HeartDisease ~ Stroke + Diabetic + Asthma + KidneyDisease 
              + SkinCancer,
              data=heart_train,family = "binomial")
summary(mod1)
xkabledply(mod1, title = paste("Logistic Regression :", format(formula(mod1)) ))
expcoeff = exp(coef(mod1))
# expcoeff
xkabledply( as.table(expcoeff), title = "Exponential of coefficients in Logit Reg" )
Exponential of coefficients in Logit Reg
x
(Intercept) 0.56
Stroke 5.31
Diabetic 2.97
Asthma 1.17
KidneyDisease 2.76
SkinCancer 1.91

Confusion matrix

#loadPkg("regclass")
# confusion_matrix
xkabledply( confusion_matrix(mod1), title = "Confusion matrix from Logit Model" )
Confusion matrix from Logit Model
Predicted 0 Predicted 1 Total
Actual 0 10424 3643 14067
Actual 1 5821 8112 13933
Total 16245 11755 28000
#unloadPkg("regclass")

Accuracy - It determines the overall predicted accuracy of the model. It is calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True Negatives + False Positives + False Negatives)

Accuracy-66.2

loadPkg("pROC")
library(pROC)
library(ROCR)
test_prob <- predict(mod1, newdata = heart_test, type = "response")
train_prob <-predict(mod1, newdata = heart_train, type = "response")
test_roc <- roc(heart_test$HeartDisease ~ test_prob, plot = TRUE, print.auc = TRUE)

fittedheartTrain=ifelse(train_prob<0.5,0,1 )
ConfMatTrain=table(heart_train$HeartDisease,fittedheartTrain)
TrainAccuracy=(ConfMatTrain[1,1]+ConfMatTrain[2,2])/sum(ConfMatTrain)
TrainTPR=72/(72+11)
TrainFPR=10/(10+79)

print(cbind(TrainAccuracy, TrainTPR, TrainFPR))

fittedheartTest=ifelse(test_prob<0.5,0,1 )
ConfMatTest=table(heart_test$HeartDisease,fittedheartTest)
TestAccuracy=(ConfMatTest[1,1]+ConfMatTest[2,2])/sum(ConfMatTest)
TestTPR=45/(45+10)
TestFPR=1-52/(52+7) ## expressed as 1-true negative rate

print(cbind(TestAccuracy, TestTPR, TestFPR))

#Decision tree model

set.seed(1)
library(rpart)
library(rpart.plot)
tree<-rpart(HeartDisease~.,method = "class",data =heart_train)
rpart.plot(tree)

probabilities <- predict(tree, heart_test)
tree_pred <- ifelse(probabilities[,2] > 0.5,1,0)
table(Actual = heart_test$HeartDisease,Predicted = tree_pred )

Accuarcy-66.7

Smokers Vs Alcoholic Drinkers

First of all does smoking and drinking cause heart attack or heart disease?

Yes, it is actually one among the reason that can cause heart disease in young people and majority of cases are seen in women according to a most of the publishes articles and researchers.

To determine if there is a link between heart disease and drug use, researchers investigated and discovered that recreational use of nicotine, cannabis, alcohol, and illegal substances such as amphetamine and cocaine may be linked to prematurely clogged arteries.Premature heart disease can arise before the age of 65 in women and 55 in males, and it can be difficult and dangerous.

According to statistics from the countrywide veterans affairs healthcare and veterans with premaTure AtheroscLerosis (VITAL) registry (2014-2015). There are 7716 patients who are having extremely premature heart disease , and they discovered this by comparing them to those who do not have premature heart disease.

According to the study and analysis, patients with early heart disease are more likely to smoke (63 percent versus 41 percent) and drink alcohol (32 percent vs 15 percent )

barplot(table(heart$Smoking, heart$HeartDisease), legend= rownames(table(heart$Smoking, heart$HeartDisease)), col =(c("orange", "blue")), xlab="heartdisease", ylab="smoking",main = "Barplot for Heartdisease and Smoking")

0 represents No 1 represents Yes Based on this barplot we can see that people who have heartdisease and smoke have more chances of getting heart disease than people who don’t have heartdiseasse but have smoking habit.

barplot(table(heart$AlcoholDrinking, heart$HeartDisease), legend= rownames(table(heart$AlcoholDrinking, heart$HeartDisease)), col =(c("red", "darkblue")), xlab="Heartdisease", ylab="Alcoholdrinking", main = "Barplot for Heartdisease and Alcoholdrinking")

0 represents No 1 represents Yes Based on this barplot we can see that people who have heartdisease and drink have chances but are very less when compared to people who don’t have heart disease and drink.

barplot(table(heart$HeartDisease, heart$Diabetic), legend= rownames(table(heart$HeartDisease, heart$Diabetic)), col =(c("green", "brown")),xlab="Diabetic",ylab="HeartDisease", main = "Barplot for Heartdisease and Diabetic")

lm1 <- lm(HeartDisease ~ AlcoholDrinking + Smoking , data = heart)
xkabledply(lm1, title = paste("Model :", format(formula(lm1)) ) )
xkablevif(lm1)
summary(lm1)

Based on this model we can see that the p-values are small indicating that there is a correlation between alcoholdrinking, smoking and heartdisease. For example for every one percent increase in smoking there is 0.064 % increase in heart disease. When we look at t-value for smoking we can see that 63.8% which indicates that smoking is dependent on the heart disease than in Alcohol Drinking.

lm2 <- lm(HeartDisease ~ Smoking + AlcoholDrinking + Diabetic + Smoking : Diabetic + Diabetic: AlcoholDrinking,data = heart)
xkabledply(lm2, title = paste("Model :", format(formula(lm2)) ) )
xkablevif(lm2)
summary(lm2)

In this model we can see that for the interaction term for alcoholdrinking and diabetic there is high p-value which indicates that there is no correlation between these varaibles and heart disease. For people who have diabetics and smoke have high chance of getting heart disease based on the p-value. Based on the t-value we can see that 60.06% for diabetics, which says that diabetics is correlated to heart disease.

library(dyn)
anovaRes = aov(HeartDisease ~ AlcoholDrinking, data=heart)
xkabledply(anovaRes, title = "ANOVA result summary")

anovaRes1 = aov(HeartDisease ~ Smoking, data=heart)
xkabledply(anovaRes1, title = "ANOVA result summary")
cor_ad_smoke <- cor(heart$AlcoholDrinking, heart$Smoking)
summary(cor_ad_smoke)
cor_as_hd <- cor(heart$AlcoholDrinking, heart$HeartDisease)
summary(cor_as_hd)
cor_hd_smoke <- cor(heart$HeartDisease, heart$Smoking)
summary(cor_hd_smoke)
contable  <- table(heart$HeartDisease, heart$Smoking)
summary(contable ) 
prop <-prop.table(contable)
percent <- round(prop*100, 2)
xkabledply(percent, title="Contingency Table for heart Disease and Smoking in Percentage")

0 represents No 1 represents yes So, in this contingency table we can see that we got 5% for people who have heartdisease and who smoke which indicates that smoke is more correlated to heartdisease.

contable2 <- table(heart$HeartDisease, heart$AlcoholDrinking)
summary(contable2) 
prop <-prop.table(contable2)
percent <- round(prop*100, 2)
xkabledply(percent, title="Contingency Table for heart Disease and AlcoholDrinking in Percentage")

0 represents No 1 represents yes So, in this contingency table we can see that we got 6.45% for people who don’t have heartdisease and who drink which indicates that drink is not that more correlated to heartdisease.

chitest1 = chisq.test(contable)
chitest1
chitest2 = chisq.test(contable2)
chitest2

Based on the Chi-square test we can say that the variables are dependent on the heartdisease.

model_lr <- glm(HeartDisease ~ Smoking + AlcoholDrinking , data = train , family = "binomial")
summary(model_lr)
xkabledply(model_lr, title = paste("Logistic Regression :", format(formula(model_lr))))

For this model we can see that the p-values are very small indicating that smoking and alcoholdrinking is related to heartdisease.

anova(model_lr, test="Chisq")
fitted.results <- predict(model_lr,newdata = test, type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$HeartDisease)
print(paste('Accuracy',1-misClasificError))

We have got an accuaracy of 0.589 which indicates that the model is pretty decent.

expcoeff = exp(coef(model_lr))
xkabledply( as.table(expcoeff), title = "Exponential of coefficients in Logit Reg" )
knitr::kable(expcoeff,caption='Odds Ratio & Confidence Interval of Model')

Using Confidence Intervals

xkabledply(confint(model_lr), title = "CIs using profiled log-likelihood" )
xkabledply(confint.default(model_lr), title = "CIs using standard errors" )
CIs using profiled log-likelihood
2.5 % 97.5 %
(Intercept) -0.383 -0.329
Smoking 0.755 0.832
AlcoholDrinking -0.796 -0.624
CIs using standard errors
2.5 % 97.5 %
(Intercept) -0.383 -0.329
Smoking 0.755 0.832
AlcoholDrinking -0.795 -0.624

Model evaluation

new_heart <- subset(heart,select = c(HeartDisease, Smoking, AlcoholDrinking, Diabetic))
cor_matrix<-cor(new_heart)
cor_matrix
model.matrix(~0+., data=new_heart) %>% 
     cor(use="pairwise.complete.obs") %>% 
    ggcorrplot(show.diag = F, type="upper", lab=TRUE, lab_size=3)

Based on the correlation matrix we can see that smoking is more correlated to heartdisease than alcoholdrinking.

#loadPkg("regclass")
xkabledply( confusion_matrix(model_lr), title = "Confusion matrix from Logit Model" )
Confusion matrix from Logit Model
Predicted 0 Predicted 1 Total
Actual 0 14062 7867 21929
Actual 1 9710 12157 21867
Total 23772 20024 43796

Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC)

loadPkg("pROC") 


library(ROCR)
p <- predict(model_lr, newdata=test, type="response")
pr <- prediction(p, test$HeartDisease)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

#area of curve is 0.593
model_lr2 <- glm(HeartDisease ~ Smoking + AlcoholDrinking + Diabetic + Diabetic:Smoking + Diabetic:AlcoholDrinking, data = train , family = "binomial")
summary(model_lr2)
xkabledply(model_lr2, title = paste("Logistic Regression :", format(formula(model_lr2))))

For this model we can see that the p-values are very small indicating that smoking and alcoholdrinking, diabetics and smoking:Diabetic is related to heartdisease.

fitted.results <- predict(model_lr2,newdata = test, type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$HeartDisease)
print(paste('Accuracy',1-misClasificError))
xkabledply(confint(model_lr2), title = "CIs using profiled log-likelihood" )
xkabledply(confint.default(model_lr2), title = "CIs using standard errors" )
CIs using profiled log-likelihood
2.5 % 97.5 %
(Intercept) -0.688 -0.6250
Smoking 0.742 0.8313
AlcoholDrinking -0.661 -0.4730
Diabetic 1.293 1.4308
Smoking:Diabetic -0.237 -0.0406
AlcoholDrinking:Diabetic -0.252 0.2761
CIs using standard errors
2.5 % 97.5 %
(Intercept) -0.688 -0.6249
Smoking 0.742 0.8312
AlcoholDrinking -0.661 -0.4726
Diabetic 1.293 1.4306
Smoking:Diabetic -0.237 -0.0407
AlcoholDrinking:Diabetic -0.254 0.2735

Model evaluation

Confusion matrix

#loadPkg("regclass")
xkabledply( confusion_matrix(model_lr2), title = "Confusion matrix from Logit Model" )
Confusion matrix from Logit Model
Predicted 0 Predicted 1 Total
Actual 0 12446 9483 21929
Actual 1 6446 15421 21867
Total 18892 24904 43796
loadPkg("pROC") 


library(ROCR)
p1 <- predict(model_lr2, newdata=test, type="response")
pr1 <- prediction(p1, test$HeartDisease)
prf1 <- performance(pr1, measure = "tpr", x.measure = "fpr")
plot(prf1)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

What is the effect of Smoking and Drinking?

The interaction between the tobacco(smoking) and alcohol can lead to heart problems and it depends on some factors in which we can find out. Initially, It might depend upon quantity of dose the person is consuming it. And based on some research I found that consuming alcohol in range of 3 to 14 drinks per week is prone to lower risk of heart attack but there are other risks factors that lead to complex issues like blockage in artery and deficiency in blood flow or it can lead to heart failure.

When looked at the cause for smoking there is a high change of risk leading to heart problems. Some heart failures caused by smoking are stroke, bleeding from brain and congestion in chest.

Does people who Drink smoke?

Some researchers have found that among people who drink have high changes of smoking most of the times.

What is the relation between smoking and heart disease ?

So, basically people who tend to smoke have issues related to cardiovascular disease(CVD). CVD occurs when chemicals in cigarettes release then the blood becomes thick and starts forming clots inside veins and arteries. When the blood vessel are completely blocked then it can lead to heart attack and sudden death. Which can lead to death of the person. In the United States, there were reportedly 800,000 people who faced cardiovascular disease and was the mere cause of death in which 8million died of heart attack and other 7 million died of stroke.

When measured on how many cigarettes people consume leads to health issues there is 100 percent chance that people’s health will be damaged no matter if they consumed one or two per day. The fact that when people consume one to two cigarettes may show signs of CVD but gradually they might face problems. Another risk factor is that if people consume more than 5 cigarettes a day they may show signs of early CVD. Based on this we can imagine that the risk of CVD increases based on the number of cigarettes consumed per day.

How is smoking effecting?

When a person smoke the chemicals substances in the cigarette cause the cells of boodle vessels stolen and inflamed which leads to blood vessels becoming narrow.

What is the relation between Drinking and heart disease ?

When a person has a habit to drink than to indicate that his health is getting damaged then we have to understand what is the percentage of alcohol being consumed per day and we can weight it in grams/units. On an average the standard practice of people who drink consume 12 to 15grams off alcohol. Based on this practice there are most often low and high risks. Low risk or moderate risk has an average of 1 to 2 drinks per day and high risk can be measured on consuming more than 4 drinks per day.

For both the chi^2 tests we performed above, the p-value<0.05. So all the above rejects null hypothesis. Both drinking and smoking have a relationship with heart Disease.

Limitations

Telephone interviews- voluntary bias (old retired): Most of the surveys were made telephonic because of the covid and most of the people didn’t want to meet in person to risk their lives for the sake of surveys & Covid. So, they started doing telephonic surveys to find out whether the key indicators like high blood pressure, high cholesterol, and smoking among most racial and ethnic groups to know how they play a crucial role in Heart Disease. Most of the responses were rude and arrogant because they were frustrated with the covid and were not getting out of the home. Few people were very nice to surveys and were cool and they responded by stating to questions they have given relevant answers to and surveys have recorded their responses.

Liars - We are uncertain about the responses because we found that the responses which were recorded were different at times. We found that when we called them for the survey we got the responses as one and while we called them and said that we may be giving them some medical discounts we found the responses as other.

During the year 2021, the covid was at its peak and most people got affected by covid. Post covid, people’s immune systems got broken down and they became prone to several illnesses like diabetes, Asthma. The death rates with covid were very high and this led to the increase in death rates of heart disease because most of the people got infected with key indicators like high BMI and high cholesterol. Not only did the older people got affected by covid it was the younger people who got affected by covid their death rates were also high. From the noted survey, people tend to develop more secondary illnesses which in turn caused them primary disease which is Heart Disease, And got affected with higher death rates.

Conclusions:

To conclude, we found strong evidence from our EDA that people with higher BMI, high cholesterol, and age affect Heart Disease. Moreover, the variables were Sex, BMI, Physical Health, Mental Health, sleep time, Secondary Illnesses, alcohol, and Drinking.

Analysis of SMART Questions:

From the first question one Which gender tends to have more diseases based on BMI, smoking, drinking, and mental health in the year 2020? We concluded that On average, people with heart disease have a slightly higher BMI than those who don’t have heart disease. However, the biggest difference is the number of days that a respondent feels physically unwell. Respondents who have heart disease reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. Poor mental health in 30 days is slightly higher for people with heart disease than for people without. we compared the different lifestyle variables between men and women and then further compared those same variables between those with and without heart disease However, we are then forced to ask why do men have more heart disease in our sample? Are they doing more or less of a certain activity than their female counterparts? we compared the different lifestyle variables between men and women and then further compared those same variables between those with and without heart disease allowing us to reject the null hypothesis and conclude that the overall average BMI is different between the male and female populations. From the mean values, we see that men have a mean BMI of 28.50 while women have a BMI of 28.16. A difference of 0.34. From the T-test output, we see that the p-value is 0.604, which is greater than 0.05. This means that we fail to reject the null and conclude that there is no significant difference in the average BMI between men and women who do have heart disease. Since the p-value is less than 5%, there is a difference in the number of days that men and women feel unwell. On average, women reported feeling sick for about 9 days while men reported just over 6 days.

#From the second smart question Are people with heart diseases more prone to developing secondary illnesses’ or is it the other way around? we concluded that secondary illnesses are significantly dependent on heart diseases. We concluded this based on the Chi-squared test we observed that the p-value<0.05. We can reject the null hypothesis. Every secondary illness (Diabetic, Asthma, Kidney disease, and skin cancer) has a relationship with heart disease.

From the third smart question Are smokers or drinkers more prone to developing heart disease? Our analysis was based on the test not to conclude any kind of assumptions or theory we did some chi-squared tests For smoking and drinking the chi^2 tests we performed above, the p-value<0.05. So we can reject the null hypothesis. Both drinking and smoking have a relationship with heart disease.

From all the smart questions we can come to conclude that smokers, drinkers, and people with secondary illnesses are more prone to develop heart diseases and they are high chances of death rates. However, people with high BMI are unhealthy and most often tend to sick.

********************************************************END OF SECTION**************************

##GENERALHEALTH, PHYSICAL ACTIVITY, ETHNICITY AFFECTS HEART DISEASE

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Title Info

Introduction

Libraries

# Lets load the appropriate libraries
library(tidyverse)
library(dplyr)
library(funModeling)
library(ezids)
library(ggplot2)
library(ggcorrplot)
library(corrr)
library(corrplot)
library(ROCR)
library(grid)
library(broom)
library(caret)
library(tidyr)
library(dplyr)
library(scales)
library(ggplot2)
# ggthemr is not on CRAN. Need to install from console using 
devtools::install_github('cttobin/ggthemr')
# Might need to install devtools first if not already on system
# Follow prompt. Use 1 or 2 to install/update other dependencies. 
# Choose "no" to install those without needing to compile
# Nevermind, still failed. Not working for R 4.1.2 (202204 - EL)
# library(ggthemr)
library(ggthemr)
library(ggthemes)
library(gridExtra)
library(data.table)

Load the Data and Rename dataframe

 heart<-read.csv("heart_2020_cleaned.csv")

Lets take a look at our data

nrow (heart)

Descrpitive Statistics

status (heart)

From the table we can see that Physical and Mental health have a higher percentage of zeros. This means that just over 70% of the people in our data set are not physically active while about 60% of the respondents do not report having mental health issues.

We can also see that there are no N/A values in our data set. This is not surprising because our source data was cleaned before.

Now to see which variables are categorical and which are numerical

data_integrity(heart)

Description of every variable

describe (heart)

Converting HeartDIsease categorical into Numerical Variable

heart$HD<-ifelse(heart$HeartDisease=="Yes",1,0)

#Converting General Health as numericals

#plot_num(heart)
heart$GenHealth<-as.factor(heart$GenHealth)
heart$GeneralHealth<-sapply(heart$GenHealth,unclass)

#5-Very good,4-poor,3-Good,2-fair,1-excellent

freq(heart$GeneralHealth)

Frequency distributions of all our categorical variables

freq(heart)

Changing Diabetes bordeline and pregnancy categories to Yes/No

heart$Diabetic<- replace (heart$Diabetic, heart$Diabetic== "No, borderline diabetes" , "Yes")
heart$Diabetic[heart$Diabetic== "Yes (during pregnancy)"] <- "No"  

#Checking if diabetic has been changed

describe (heart$Diabetic)
freq(heart$Diabetic)

Changing all the categorical variables to numerical dummies

heart$Smoking<-ifelse(heart$Smoking=="Yes",1,0)
heart$AlcoholDrinking<-ifelse(heart$AlcoholDrinking=="Yes",1,0)
heart$Stroke<-ifelse(heart$Stroke=="Yes",1,0)
heart$DiffWalking<-ifelse(heart$DiffWalking=="Yes",1,0)
heart$PhysicalActivity<-ifelse(heart$PhysicalActivity=="Yes",1,0)
heart$KidneyDisease<-ifelse(heart$KidneyDisease=="Yes",1,0)
heart$SkinCancer<-ifelse(heart$SkinCancer=="Yes",1,0)
heart$Diabetic<-ifelse(heart$Diabetic=="Yes",1,0)
heart$Asthma<-ifelse(heart$Asthma=="Yes",1,0)

Changing race variable to numerical variable

heart$Race<-as.factor(heart$Race)
heart$Ethnicity<-sapply(heart$Race,unclass)

Converting white as 6, Hispanic as 5,Black as 4, Other as 3, Asian as 2, American Indian as 1

freq(heart$Ethnicity)

# Finding the mean of numerical variables grouped by heart disease
plot_num(heart)

heart %>%
  group_by(HD) %>%
  summarise_at(vars("GeneralHealth", 
                    "PhysicalActivity", 
                    "Ethnicity"), mean)
## # A tibble: 2 × 4
##      HD GeneralHealth PhysicalActivity Ethnicity
##   <dbl>         <dbl>            <dbl>     <dbl>
## 1     0          3.23            0.788      5.38
## 2     1          3.17            0.639      5.53
loadPkg("corrplot")
cor(heart$HD,heart$GeneralHealth)
hist(heart$GeneralHealth)

hist(heart$Ethnicity)

contable = table(heart$GeneralHealth, heart$HD)
xkabledply(contable, title="Contingency Table for Heart Disease and GeneralHealth")
addmargins(contable)

prop <-prop.table(contable)
percent <- round(prop*100, 2)

print ("Contingency Table for heart Disease and General Health in Percentage")
print (percent)
barplot(with(heart, table(GeneralHealth, HD)), main="heart Disease & GeneralHealth", beside=T, col=3:2)

# Subset Male & Female with and without Heart Disease
female <- subset(heart, Sex== "Female")
male  <- subset (heart, Sex == "Male")
# Subsetting genders based on respondents that have heart disease 
female_HD<- subset(female, HD==1)
male_HD<- subset(male, HD== 1)
#T-test
t.test(female_HD$GeneralHealth, male_HD$GeneralHealth)

The p-value is less than the 8% significance level, allowing us to reject the null hypothesis and conclude that the overall average GeneralHealth is different between the male and female populations. From the mean values, we see that men have a mean GeneralHealth of 3.12 while women have a GeneralHealth of 3.19. A difference of 0.17.

t.test(female_HD$PhysicalActivity, male_HD$PhysicalActivity)
boxplot(HD~GeneralHealth,data=heart)

plot(HD~PhysicalActivity,data=heart)

plot(HD~Ethnicity,data=heart)

plot(HD~DiffWalking,data=heart)

t.test(heart$GeneralHealth, heart$HD)
t.test(heart$PhysicalActivity, heart$HD)
t.test(heart$Ethnicity, heart$HD)
Genh_HD=table(heart$GeneralHealth,heart$HD)
chi_1=chisq.test(Genh_HD)
chi_1
HD_Eth=table(heart$PhysicalActivity,heart$HD)
chi_2=chisq.test(HD_Eth)
chi_2

For all the chi^2 tests we performed above, the p-value<0.05. So all the above rejects null hypothesis. Every GeneralHealth,PhysicalActivity has a relationship with heart Disease.

Regression <- lm(HD~GeneralHealth, data=heart)
BIC(Regression)
## [1] 92826
summary (Regression)
## 
## Call:
## lm(formula = HD ~ GeneralHealth, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.0901 -0.0881 -0.0860 -0.0820  0.9180 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.092091   0.001150   80.07   <2e-16 ***
## GeneralHealth -0.002017   0.000322   -6.26    4e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.28 on 319793 degrees of freedom
## Multiple R-squared:  0.000122,   Adjusted R-squared:  0.000119 
## F-statistic: 39.1 on 1 and 319793 DF,  p-value: 3.96e-10
Regression2 <- lm(HD~GeneralHealth+Ethnicity, data=heart)
BIC(Regression2)
summary (Regression2)
reg_3 <- lm(HD~GeneralHealth+PhysicalActivity+Ethnicity, data=heart)
BIC(reg_3)
summary (reg_3)

With each new addition of a variable, we see that the adjusted R-squared is increasing indicating that the model fit is improving. However, the fit statistic is still quite low.Since model 3 has the highest adjusted r-squared and the lowest BIC values, it will be our preferred model.

newdata1<- subset(heart, HD==1)
newdata1_1<-head(newdata1,27373)
nrow(newdata1_1)
newdata0<- subset(heart, HD==0)
newdata0_1<-head(newdata0,27373)
nrow(newdata0_1)
total <- rbind(newdata1_1, newdata0_1)
nrow(total)
summary(total)
freq(total)

#model 1

# Split into 80-20 train-test sets

set.seed(1234) 

sample <- sample.int(n = nrow(total), size = floor(.80*nrow(total)), replace = F)
train <- total[sample, ]
test  <- total[-sample, ]
# Logit Model Training
model<- glm(HD~ GeneralHealth, data= train, family="binomial")
summary(model)

Predicting and Assessing the Model

# Prediction
train$prediction <- predict( model, newdata = train, type = "response" )
test$prediction  <- predict( model, newdata = test , type = "response" )
# distribution of the prediction score grouped by known outcome
ggplot( train, aes( prediction, color = as.factor(HeartDisease) ) ) + 
geom_density( size = 2 ) +
ggtitle( "Training Set's Predicted Score" ) + 
scale_color_economist( name = "data", labels = c( "No Heart Disease", "Yes Heart Disease" ) ) + 
theme_economist()

# Positive instance = 1 = Yes Heart Disease 
# Negative instance = 0 = No Heart Disease
# user-defined different cost for false negative and false positive

roc_info1 <- ROCInfo( data = test, 
                     predict = "prediction", 
                               actual = "HD", 
                               cost.fp = 100,
                               cost.fn = 200 )
p <- recordPlot()
dev.off()
grid.draw(roc_info1$plot)
#accuracy_info <- AccuracyCutoffInfo( train = train, test = test, predict = "prediction", actual = 'HeartDisease' )
# define the theme for the next plot
#ggthemr("light")
#accuracy_info$plot
#loadPkg("regclass")
xkabledply( confusion_matrix(model), title = "Confusion matrix from Logit Model" )
#unloadPkg("regclass")

# 
cm_info <- ConfusionMatrixInfo( data = test, predict = "prediction", actual = 'HD', cutoff = .49 )
#ggthemr("flat")
cm_info$plot

Model 2

# Split into 80-20 train-test sets

set.seed(1234) 

sample <- sample.int(n = nrow(total), size = floor(.80*nrow(total)), replace = F)
train <- total[sample, ]
test  <- total[-sample, ]
# Logit Model Training
model2<- glm(HD~ GeneralHealth+Ethnicity, data= train, family="binomial")
summary(model2)

Predicting and Assessing the Model

# Prediction
train$prediction <- predict( model2, newdata = train, type = "response" )
test$prediction  <- predict( model2, newdata = test , type = "response" )
# distribution of the prediction score grouped by known outcome
ggplot( train, aes( prediction, color = as.factor(HeartDisease) ) ) + 
geom_density( size = 2 ) +
ggtitle( "Training Set's Predicted Score" ) + 
scale_color_economist( name = "data", labels = c( "No Heart Disease", "Yes Heart Disease" ) ) + 
theme_economist()

# Positive instance = 1 = Yes Heart Disease 
# Negative instance = 0 = No Heart Disease
# user-defined different cost for false negative and false positive

roc_info2 <- ROCInfo( data = test, 
                     predict = "prediction", 
                               actual = "HD", 
                               cost.fp = 100,
                               cost.fn = 200 )
p <- recordPlot()
dev.off()
grid.draw(roc_info2$plot)
#accuracy_info <- AccuracyCutoffInfo( train = train, test = test, predict = "prediction", actual = 'HeartDisease' )
# define the theme for the next plot
#ggthemr("light")
#accuracy_info$plot
#loadPkg("regclass")
xkabledply( confusion_matrix(model2), title = "Confusion matrix from Logit Model" )
#unloadPkg("regclass")

# 
cm_info2 <- ConfusionMatrixInfo( data = test, predict = "prediction", actual = 'HD', cutoff = 0.27 )
#ggthemr("flat")
cm_info2$plot

#Model 3

# Split into 80-20 train-test sets

set.seed(1234) 

sample <- sample.int(n = nrow(total), size = floor(.80*nrow(total)), replace = F)
train <- total[sample, ]
test  <- total[-sample, ]
# Logit Model Training
model3<- glm(HD~ GeneralHealth+Ethnicity+PhysicalActivity, data= train, family="binomial")
summary(model3)

Predicting and Assessing the Model

# Prediction
train$prediction <- predict( model3, newdata = train, type = "response" )
test$prediction  <- predict( model3, newdata = test , type = "response" )
# distribution of the prediction score grouped by known outcome
ggplot( train, aes( prediction, color = as.factor(HeartDisease) ) ) + 
geom_density( size = 2 ) +
ggtitle( "Training Set's Predicted Score" ) + 
scale_color_economist( name = "data", labels = c( "No Heart Disease", "Yes Heart Disease" ) ) + 
theme_economist()

# Positive instance = 1 = Yes Heart Disease 
# Negative instance = 0 = No Heart Disease
# user-defined different cost for false negative and false positive

roc_info3 <- ROCInfo( data = test, 
                     predict = "prediction", 
                               actual = "HeartDisease", 
                               cost.fp = 100,
                               cost.fn = 200 )
p <- recordPlot()
dev.off()
grid.draw(roc_info3$plot)
### ROC Curve
library(pROC)
roc_svm_test <-plot(roc(as.numeric(test$HD),as.numeric(test$prediction)),main="Comparaison ",col="#1c61b6")
plot(roc_svm_test, add = TRUE,col = "red", print.auc=TRUE, print.auc.x = 0.5, print.auc.y = 0.3)
legend(0.3, 0.2, legend = c("test-glm"), lty = c(1), col = c("blue"))

#accuracy_info <- AccuracyCutoffInfo( train = train, test = test, predict = "prediction", actual = 'HeartDisease' )
# define the theme for the next plot
#ggthemr("light")
#accuracy_info$plot
#loadPkg("regclass")
#xkabledply( confusion_matrix(model3), title = "Confusion matrix from Logit Model" )
#unloadPkg("regclass")

# 
#cm_info2 <- ConfusionMatrixInfo( data = test, predict = "prediction", actual = 'HeartDisease', cutoff = 0.27 )
#ggthemr("flat")
#cm_info2$plot

#Limitations of Dataset: Our dataset contains 18 variables, for the EDA Analysis we have used only a few variables for our analysis and drew some conclusions and we may use all the variables for the final that is building the model.

Next Steps: Next steps are to use what we learned in our dataset and develop a linear model. And try to create a model which is helpful for everyone. As we all know Health is Wealth and without health, we can’t achieve heights so we tend to develop a model which we can use to know which variables are developing heart disease and with that, we can try to decrease the effect on Heart Disease.

References

Ara S. A literature review of cardiovascular disease management programs in managed care populations. Journal of Managed Care Pharmacy 2004; 10(4): 326-344.

Centers for Disease Control and Prevention. Underlying Cause of Death, 1999–2018. CDC WONDER Online Database. Atlanta, GA: Centers for Disease Control and Prevention; 2018.

CDC. (2019, December 2). Heart Disease Facts | cdc.gov. Centers for Disease Control and Prevention.

Giampaoli S, Palmieri L, Mattiello A, et al. Definition of high risk individuals to optimize strategies for primary prevention of cardiovascular diseases. Nutr Metab Cardiovasc Dis 2005;15:79–85.

Here’s What an Extra 20 Minutes of Sleep Does to Your Brain. (2020, July 17). Well+Good. https://www.wellandgood.com/benefits-of-getting-more-sleep/

Hong, Y. (2018). Framingham Heart Study (FHS) [Review of Framingham Heart Study (FHS)]. National Heart, Lung and Blood Institute. https://www.nhlbi.nih.gov/science/framingham-heart-study-fhs

Liau, S. Y., Mohamed Izham, M. I., Hassali, M. A., & Shafie, A. A. (2010). A literature review of the cardiovascular risk-assessment tools: applicability among Asian population. Heart Asia, 2(1), 15–18. https://doi.org/10.1136/ha.2009.001115

Robbins, R., Quan, S. F., Weaver, M. D., Bormes, G., Barger, L. K., & Czeisler, C. A. (2021). Examining sleep deficiency and disturbance and their risk for incident dementia and all-cause mortality in older adults across 5 years in the United States. Aging, 13(3), 3254–3268. https://doi.org/10.18632/aging.202591

Sadick, B. (2019, April). Women Die From Heart Attacks More Often Than Men. Here’s Why — and What Doctors Are Doing About It. Time; Time. https://time.com/5499872/women-heart-disease/

Bethesda, MD: National Institutes of Health. https://www.cdc.gov/stroke/conditions.htm#:~:text=Common%20heart%20disorders%20can%20increase,rich%20blood%20to%20the%20brain.

Antoine Soetewey https://statsandr.com/blog/chi-square-test-of-independence-in-r/

https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/chi-square/